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Abstract: Bag-of-Distributed-Tasks (BoDT) application is the collection of identical and independent tasks each of 

which requires a piece of input data located around the world. As a result. Cloud computing offers an ef¬ 
fective way to execute BoT application as it not only consists of multiple geographically distrihuted data 
centres hut also allows a user to pay for what she actually uses only. In this paper, BoDT on the Cloud using 
virtually unlimited cloud resources. A heuristic algorithm is proposed to find an execution plan that takes 
budget constraints into account. Compared with other approaches, with the same given budget, our algorithm 
is able to reduce the overall execution time up to 50%. 


1 INTRODUCTION 

Bag-of-Tasks (BoT) is the collection of identical 
and independent tasks. In other works, tasks of a BoT 
application can be executed by the same application 
but in any order. Bag-of-Distributed-Tasks (BoDT) is 
a subset of BoT in which each task requires data from 
somewhere around the globe. The location where a 
task is executed is essential for keeping the execution 
time of the BoDT low, since data is transferred from 
a geographically distributed location. It is ideal to as¬ 
sign tasks to locations that would be in geographically 
close proximity to the data. 

The centralised approach for executing BoDT, in 
which data from multiple locations are transferred and 
executed at a single location, tends to be ineffective 
since some data resides very far from the selected lo¬ 
cation and takes a long time to be downloaded. An¬ 
other approach is to group the tasks of the BoDT in 
such a way that each group can be executed near the 
location of the data. However, this approach requires 
an infrastructure which is decentralised and globally 
distributed. Cloud computing is ideal suited for this 
since public cloud providers have multiple data cen¬ 
tres which are globally distributed. Furthermore, due 
to its pay-as-you-go scheme, using Cloud computing 
is cost effective as a user only pays for Virtual Ma¬ 
chines (VMs) that are required. 

Cloud computing can facilitate the execution of 
BoDT, and at the same time introduce the challenge 
of assigning tasks to VMs by considering the loca¬ 


tion for processing each task, the user’s budget con¬ 
straint, as well as the desired performance, i.e. exe¬ 
cution time, for executing the task. In an ideal case, 
it is expected that maximum performance is obtained 
while minimising the costs. 

In our previous paper ( |Thai et al., 20I4b] l, we ap¬ 
proached this problem by assuming limited resources 
were available. However, as Cloud provider offers 
virtually unlimited resources, the limit should be de¬ 
termined based on the user’s budget constraint. In this 
paper, we present our approach for executing BoDT 
on the Cloud on virtually unlimited resources and is 
only limited by a user specified budget constraint. 
Compared with other approaches, with the same given 
budget, our algorithm is able to reduce the overall ex¬ 
ecution time up to 50%. 

The contributions of this paper are i) the com¬ 
plete mathematical model of executing a BoDT ap¬ 
plication on the Cloud with budget constraint, ii) the 
heuristic algorithm which assigns tasks to Cloud re¬ 
sources based on their geographical locations, and iii) 
the evaluation comparing our approach with the cen¬ 
tralised and the round robin approaches. 

The remainder of paper is stmctured as follow. 
Section II presents the mathematical model of the 
problem. Section III introduces the heuristic algo¬ 
rithms producing an execution plan based on the 
user’s budget constraint. Section IV evaluates the ap¬ 
proach. Section V presents the related work. Finally, 
this paper is concluded in section VI. 




2 PROBLEM MODELLING 


Let L = {/i. be the list of Cloud locations, i.e. 
location of Cloud provider’s data centres, and VM = 
{vmi...} be the list of Cloud VMs. For vm GVM, 
Ivm G L denotes the location in which vm is deployed. 
Let VMi C VM be the list of all VMs deployed at lo¬ 
cation / G L. The number of items in VM is not fixed 
since a user can initiate as many VMs as possible. 

Let T = {?!...?„} be the list of tasks, and sizct de¬ 
note the size of a task. The time (in seconds) taken 
to transfer data from a task’s location to a Cloud lo¬ 
cation is denoted as transtj. Similarly, transt^vm for 
vm G VM is the cost of moving f to vm (or to a loca¬ 
tion on which vm in running; trans, y„f = transtj^^,). 
We assume that there is only one type of VM is used, 
hence, the cost of processing one unit of data is iden¬ 
tical and is denoted as comp. 

The time taken to execute task t at vm is: 


exect.vm = = {tranSf y^ + comp) x size, (1) 

Let Tym C r be the list of tasks executed in vm G 
VM. All tasks must be executed and is represented as 
the following constraint: 

U Tym = T (2) 

vm€VM 

One task should not be executed in more than one 
location expressed as an additional constraint: 

Ti nTj = & for i, j G VM and i ^ j (3) 

The execution time of all tasks on vm G VM is: 

execT,,„ = 52 (4) 

As it takes some times to create a VM, the over¬ 
head associated with the start up of each VM denoted 
as start_up. The execution time of vm G VM to exe¬ 
cute all tasks in Tym is: 


execym = start _up + execTy,y (5) 

It should be noted that Equation]^ can only be ap¬ 
plied if there are task(s) assign to a VM, i.e. Tym ^ 0. 
Otherwise, it is unnecessary to create a VM, thus its 
execution time is zero. 

Assuming each VM is charged by hour, i.e. 3600 
seconds, the number of charged time blocks is: 


tb 


vm 


3600 


(6) 


Equation contains the ceiling function, which 
means the execution time is rounded up to the near¬ 
est hour in order to calculate the number of used time 
blocks. In other words, a user has to pay for a full 
hour even if only a fraction of the hour is used. 


Let P = {Tym^...Tym^) be the execution plan, 
whose each item is a group of tasks assigned to one 
vm G VM. Let VMp denote the list of VMs used by 
execution plan F. Similarly, let Lp be the list of loca¬ 
tions where all VMs of plan P are deployed. More¬ 
over, Pi denotes the execution plan for location I G L, 
which means Lp^ = {/} and VMp^ = VM,. 

As all VMs are running in parallel, the execution 
time of a plan is equal to slowest VM’s: 

execp = max exeCym (7) 

vm€VMp 

The total number of time blocks used is the sum 
of the time blocks used by each VM, represented as: 

tbp = ^2 ( 8 ) 

vm€VMp 

The budget constraint is the amount of money that 
a user is willing to pay for executing the BoDT. Even 
though Cloud providers charge users for using com¬ 
pute time on virtual machines and transferring data, 
only the renting cost is considered as the amount of 
downloaded is unchanged for any given problem, i.e. 
regardless the execution plan, the same amount of 
data is downloaded, thus the data transferring cost. 

The budget constraint is mapped onto the number 
of allowed time blocks tbb by dividing the budget to 
the cost of one time block (this is possible, because 
of the assumption that there is only one VM type). 
Hence, the problem of maximising the performance 
of executing a BoDT on the Cloud with a given bud¬ 
get constraint is to find an execution plan P in order 
to minimise execp while keeping tbp — tbi, and satis¬ 
fying constraints in Equations]^ and 


3 ALGORITHMS 

As stated in the previous section, the optimal 
plan for executing BoDT on the Cloud with budget 
constraint can be found by solving the mathemati¬ 
cal model. However, solving the mathematical model 
can take the considerable amount of time since it in¬ 
volves considering multiple possibilities of assigning 
tasks to different VMs at different Cloud locations. In 
this section, we propose an alternative approach as the 
heuristic algorithm for finding an executing plan for a 
BoDT based on a user’s budget constraint. 

3.1 Select Initial Number of VMs at 
Each Location 

The main idea of the approach presented in this paper 
is to specify a set of VMs to each location, then to 



reduce them until the total number of VMs across all 
locations is tbi,- 

In order to determine the initial number of VMs at 
each location, we make an assumption that it is pos¬ 
sible to limit each VM to be executing in one time 
block, i.e. if a VM finishes its execution in more than 
one time block, its tasks can be split and scheduled 
onto two VMs. Then, the total number of time blocks 
is equal to the total number of VMs across all loca¬ 
tions. Thus, the constraint tbb also limits the total 
number of VMs, each of which uses no more than 
one time block. Hence, initially, the number of VMs 
at each location, i.e. VMt for / G L, can be set to tbb- 

3.2 Find Execution Plan based on 
Budget Constraint 


is the case, then it is impossible to find an execution 
plan satisfying the given budget constraint. 

Secondly, some VMs are removed by moving its 
tasks to other ones until the budget constraint is sat¬ 


isfied (From Line 10 to pj]i. The reassignment can 
be performed between VMs in the same location or 
across multiple locations. If, after reducing, the num¬ 
ber of VMs is still higher than tbb, it is impossible to 
satisfy the budget constraint (Lines [T4|and[T5|). 

Finally, as the execution times between VMs are 
different (for example, one VM can take longer to fin¬ 
ish than the other ones) it is necessary to balance out 
the execution times between all VMs so that they can 
finish at the same time, thus reduce the overall execu¬ 
tion time (LinepT|. 


3.3 Assign Tasks to VMs 


Let P„i be the plan in which tasks are assigned to their 
nearest location, i.e. the location in which exectj is 
minimum. Each item in P„i represents the list of tasks 
assigned to a location (not a VM). 


Algorithm 1 Find Execution Plan based on Budget 
Constraint 

1 

function EIND_PLAN(fZ7fo,P„/,yM) 

2 


3 

for 1 G Lp^^i do 

4 

Pi^ASSIGN{Ti,VMi) 

5 

if tbpi > tbb then 

6 

FAIL 

7 

end if 

8 

P^P, 

9 

end for 

10 

P G- REDUCE{P,%, TRUE) 

11 

\itbp > tbb then 

12 

P G- REDUCE{P,%,FALSE) 

13 

end if 

14 

\itbp > tbb then 

15 

FAIL 

16 

end if 

17 

P ^ BALANCE [P) 

18 

return P 

19 

end function 


Algorithm[2finds a plan with minimum execution 
time based on the budget constraint tbb- The nearest 
plan Pnl and the initial list of virtual machines VM are 
provided as input. The algorithm uses three functions, 
namely ASSIGN, REDUCE and BALANCE. 

First of all, the algorithm assigns tasks to VMs 
deployed in their nearest locations (From Line to 
0. Line [^checks if the number of used time block in 
a location is more than the budget constraint. If that 


Algorithm 2 Assign Tasks to VMs 

1: function ASSIGN(r,yM') 

2: T' G- T' sorted by —exectj for t GT’ 

3: for t GT' do 

4: VMq g- VM’ filtered execvm + exect^vm < 

3600 

5: ifyMo = 0then 

6: FAIL 

7: end if 

8: VMq g- VM' sorted by {trans,^vm,execvm) 

for vm GVM' 

9: VMq G- argmin^,^g^,^, trans,^^„ 

10: vm^yMo[0] 

11: Tym •<—7),,,, U {f} 

12: end for 

13: P„i G- {Tym for vm G VM'} 

14: return P„i 

15: end function 


Algorithm]^ aims to evenly distributed tasks from 
T' to the set of receiving VMs. 

First of all, tasks are sorted in descending order 
based on their execution times (Line [^. Then, for 
each task, all the VMs which can execute it without 
requiring more than one time block is selected (Line 
0- If there is no VM selected, i.e. it will take more 
than one time block if a task is assigned to any given 
VMs, the function fails (Lines and [^. 

All the selected VMs are sorted based on the dis¬ 
tance between VM’s location and the task’s location, 
and by their current execution time (Lines 0. The 
task is assigned to the first VM in the sorted collec¬ 
tion (Lines [T0| and [TT]l. In other words. Algorithm]^ 
tries to assign a task to the nearest VM with the lowest 
execution time. 














3.4 Reduce the Number of VMs 


Algorithm 3 Reduce VMs 

1; function REDUCE(P,/gn, is JocaZ) 

2; vm ^ arg execym 

3; if isJocal = TRUE then 

4: VM' •!— — vm 

5: else 

6: VM’ ^ VMp — vm 

7: end if 

8 : P' ^ASSIGN{Ty„„VM') 

9: if tbpi < tbp then 

10 : P^P' 

11: else 

12: Ign •<— Ign U {vm} 

13: end if 

14: if tbp — tbb or Ign = VMp then 

15: return VMi for I G L 

16: else 

17: return LOCALJiEDUCE{Pn,Ign) 

18: end if 

19: end function 


Algorithm]^ is used to reduce the number of VMs 
by moving all tasks from one VM to others which are 
either in the same or on different locations. It is a re¬ 
cursive process which takes the current plan P„, and 
the list of VMs which cannot be removed from the 
plan Ign, and the boolean value indicating if the re¬ 
ducing process is applied locally or globally isJocal. 

Eirst, a VM with lowest execution time is selected 
(Line|^. Then the remaining VMs, which can be ei¬ 
ther in the same (Line or on different Cloud loca¬ 
tion (Line|^, are selected as receiving VMs. 

After that, all tasks from selected VM are reas¬ 
signed to other VMs (Line by reusing the Algo¬ 
rithm Notably, the receiving VMs are not empty 
but already contain some tasks. 

If the reassignment reduces the number of VMs 
(Line|^, the current plan is updated (Line[T0|. Other¬ 
wise, the selected VM is added into the ignore list Ign 
(Line [T^. If the total time block satisfies the given 
constraint or all VMs are ignored (Line[l4|), the pro¬ 
cess stops and returns the current plan (Line[l4|, oth¬ 
erwise it continues (Line[T7]l. 

3.5 Balance Tasks Between VMs 

After the budget constraint is satisfied, the execution 
times between VMs can be uneven, i.e. some VMs 
can have higher execution times than the others. As 
the execution time of the plan execp is based on the 


Algorithm 4 Balancing Algorithm 

1: function BALANCE(P) 

2: vm ^ argmin,,^gy^^ execym 

'^vm ^ T'vm sorted by PXCCt^ym 

4: for t e do 

5: VM\ *r- {VMp — {vm}) sorted by transt,vm 

6: votq NULL 

7: for vm\ G VM\ do 

8: if t is never in vm\ then AND rtci + 

exeCf^d < rfco 

9: vmo ^ vm\ 

10: BREAK 

11: end if 

12: end for 

13: if vmQ ^ NULL then 

14: BREAK 

15: end if 

16: end for 

17: if vmo 7 ^ NULL then 

18: T{m'’^Tvm — t 

19: Tymo U {f} 

20: P G- {P — {Tvm,TvmQ})VI 

21: gOto|^ 

22: end if 

23: return E 

24: end function 


VM with highest execution time, it is necessary to 
balance out execution time between them. 

Algorithm]^ is an iterative process which tries to 
move tasks from a VM with highest execution time 
(Line to the nearest VM possible. There are two 
conditions for selecting a receiving VM: the selected 
task is never assigned to it and its execution time after 
receiving the task is not higher than the current exe¬ 
cution time of the giving VM (Line|^. 

3.6 Dynamic Scheduling To Avoid Idle 
VM 

Even though Algorithm aims to build the plan in 
which all VMs finish their execution nearly at the 
same time, due to the instability of the network and 
other unaccountable factors, e.g. service failure, it is 
not unusual for one VM to finish before others. As 
the cost of a full hour is already paid, it is necessary 
to utilise the remaining time of the finished VMs in 
order to reduce not only idle and unpaid time but also 
the execution time of other VMs. 

Let rtym be the actual running time of a VM. 
Let Cvm and be the estimated remaining ex¬ 
ecution time and remaining tasks of vm G VM. 
terminate Jime denote the time it take for a VM to be 












Algorithm 5 Dynamic Reassignment 

1: function REASSIGN(vm) 

2: if 3600 — rfvm < terminate Jime then 

3; FAIL 

4: end if 

5: VMi 4~ {VMp — {vm}} sorted by —e,,mi for 

vmi G VM\ 

6: vmo^NULL 

7: for vmi G VMi do 

8; if evmi < thri AND < thr 2 then 

9: vmo ^ vm\ 

10; BREAK 

11: end if 

12: end for 

13: if vmo = NULL then 

14: FAIL 

15: end if 

16: r/ ^ sorted by transtym for t G E/ 

17: r^0 

18; ei G- 3600 — — terminate Jime 

19: for t G r/ do 

20: exec'j G- execj + exec^vm 

21; if exec'j > qR exec'j > el then 

22; BREAK 

23; end if 

24; r^rulf} 

25: Tl^Tl-it) 

26: end for 

27: 

28: Tyin G- T 

29: TIME_OUT{vm,el) 

30: end function 


shut down. Finally, thri and thr 2 are 2 threshold val¬ 
ues indicating the required remaining execution time 
and required remaining number of tasks. As unhn- 
ished VMs are still running when the reassignment is 
being performed, those thresholds aim to avoid reas¬ 
signing tasks already executed by one VM to another. 
The idea of dynamic rescheduling is to move of a 
VM to another hnished one while satisfying thri 
thr 2 in order to reducing its gym- 

In order to support dynamic scheduling, we add a 
feature which monitors the execution of VMs, keeps 
track of the remaining tasks and execution times, and 
detects a VM which has just hnished its execution. 

Every time there is a VM that has just hnished 
its execution. Algorithm]^ is invoked. First, it check 
whether there is enough time in a hnished VM to exe¬ 
cute some tasks (Line[^. This check ensures that the 
hnished VM is able to be terminated before using an¬ 
other time block. Then, the VM which not only has 
the highest remaining execution time but also satishes 


thri thr 2 is selected (Lines[5]top3]l. 

After that, some of the tasks are moved from the 
selected VM to the hnished one until some conditions 
are met: i) the execution time of the hnishes VM is 
greater or equal half of the remaining execution time 
of the giving one, or, ii) the hnished VM will take 
more than one time block to hnish its execution if 
more tasks are added (from Linesp^to[26ll. 

Notably, Algorithm is only be invoked one at a 
time, i.e. if there are multiple hnished VMs, only one 
of them is reassigned tasks while other have to wait. 

Finally, the timeout feature is added to prevent the 
hnished VM, which is just assigned some more tasks, 
to use more than one time block. Basically, it takes 
the VM and the allowed execution time as arguments 
(Fine [29| , if the VM is still running when time out, 
it is automatically terminated and the remaining tasks 
are moved to another VM with lowest remaining exe¬ 
cution time, i.e. the one that is likely to hnish hrst. 


4 EXPERIMENTAL EVALUATION 


4.1 Set-up 


In order to evaluate our proposed approach, we de¬ 
veloped a Word Count application in which each task 
involved downloading and counting the number of 
words in a hie from a remote server. Those hies were 
located on PlanetFab (PF), a test-bed for distributed 
computing experiments ( Chun et ah, 2003) . We had 
5700 tasks, i.e. hies, distributed across 38 PF nodes 
and the total amount of data for each experiment run 
was more than 12 gigabytes. The VMs were deployed 
on 8 different Amazon Web Service (AWS) regions. 

Prior to the experiment, we ran the test with fewer 
tasks in order to collect the computational cost, i.e. 
comp, and communicational costs between all AWS 
regions and PlanetFab Nodes (i.e. trans). 

Based on our algorithm, at least 4 VMs were re¬ 
quired to execute all 5700 tasks. We then set tbi, = 
{4,6,8,10,12,14,16,18,20}, i.e. the number of time 
block (or VMs) that we wanted to use. For each value 
of tbh, we ran the execution three times to hnd the 
mean and standard deviation. 

For comparison, we implemented two simple ap¬ 
proaches for executing BoDT on the Cloud; 


• Centralised approach; one cen¬ 

tralised location was selected as 4 = 
argmin/g^ i.e. the lo¬ 

cation whose the cost of moving all tasks to it was 
minimum in comparison with other locations. 
This approach was developed based on the 












centralised approach introduced in our previous 
paper ( Thai et al., 2014b) l, however, instead of 
using only one VM at the selected location, in 
this paper, the number of VMs was equal to the 
one used by our proposed approach. In other 
words, this centralised approach enjoyed the 
same execution parallelism as the proposed one. 


• Round Robin approach: for this approach, all 
Cloud locations was sorted in ascending order 
based on their costs of moving all tasks to them. 
Which means the first Cloud location was the one 
selected by the centralised approach. After that, 
VMs were added to each location in circular or¬ 
der, e.g. the first VM was added to the first Cloud 
location in the sorted list. 


For both approaches. Algorithm was used to 
evenly distribute tasks to all VMs. 
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4.2 Dynamic Reassignment 


■ ap-southeast-1 

□ ap-southeast-2 

□ us-east-1 

□ us-west-2 



Without_Reassignment With_Reassignment 


Figure 1: Compare execution without and with reassign¬ 
ment 

Before going into the main experiment, it is nec¬ 
essary to demonstrate the need of using dynamic reas¬ 
signment for VMs that finish executing their assigned 
tasks earlier than others. Figure [T] presents the result 
of running the same execution plan with tbb = 4, i.e. 
there were 4 VMs. Each bar represents the execution 
time of a VM. Without reassignment, one VM took 
longer to finish its execution thus increasing the over¬ 
all execution time. Dynamic reassignment helped to 
balance out the execution time between VMs so that 
all VMs could finish at about the same time, which in 
turn reduced the overall execution time. 

For the remainder of the experiments presented in 
this section, dynamic reassignment is applied. 


Figure 2: Execution Times 


4.3 Experimental Results 

Figure [^presents the execution times corresponding 
for each value of the number of VMs for all three ap¬ 
proaches. The centralised approach had the highest 
execution times as even though it selected the loca¬ 
tion with lowest transfer cost for all tasks but some 
tasks were very far from the Cloud location which re¬ 
sulted in the high data transfer time. On the other 
hand, the round robin approach performed better as 
it deployed VMs at multiple Cloud locations, which 
means it was possible for tasks to be executed near 
their data sources. Finally, it is evidently to see that, 
with the same number of VMs (or budget), our ap¬ 
proach always had the lowest execution time, i.e. per¬ 
formed better, in comparison with other two. 

A reason for the improvement is that our approach 
not only deployed VMs at multiple locations but also 
carefully selected those locations so that the major¬ 
ity of tasks could be executed near their data sources. 
The two simple approaches decided the location(s) of 
VMs based on all tasks, by assuming all tasks were 
assigned to one Cloud location. On the other hand, 
our approach took a more fine-grain method by as¬ 
signing each task to its nearest location first and then 
reassigning them to others location until the budget 
constraint was satisfied. 

As the result, with the same given budget con¬ 
straint, our approach was 30% to 50% faster than 
the centralised approach. In comparison to the round 
robin approach, ours was able to reduce the execution 
times up to 30%. 





















5 RELATED WORK 


■ Decentralised Approach 

□ Centralised Approach 

□ Round Robin Approach 
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Figure 3: Actual Number of Used Time Blocks, i.e. cost 


Figure presents the number of actual time 
blocks, which can be mapped onto actual cost, con¬ 
sumed by three approaches. It shows that our ap¬ 
proach was able to satisfy the budget constraint in all 
cases. Moreover, when there were 4 VMs, the cen¬ 
tralised and round robin approaches were more ex¬ 
pensive than the decentralised one. It was because 
each of their VMs required more than one hour to fin¬ 
ish executing all the assigned tasks and the overall ex¬ 
ecution time was higher than 3600 seconds, as shown 
by Figure]^ Which means that the constraint tbh = 4 
could only be satisfied by the decentralised approach. 


4.4 Trade-off Between Cost and 
Performance 


As presented in Figure the higher the budget con¬ 
straint is (i.e. more VMs), the better the performance 
is. In theory, it is possible to keep adding more VMs 
in order to achieve better performance. However, the 
performance gain for each additional VM also de¬ 
creases as the total number of VMs increases. 

Hence, it is up the user to decide how much im¬ 
provement in performance can be afforded. There 
are some simple criteria to consider such as a defined 
budget constraint, the desired execution time or defin¬ 
ing a threshold in the performance gain (for example, 
stop adding more VM(s) if the performance gain is 
less than 60 seconds). 

A user can also make the decision of how many 
VMs to use based on the trade-off between perfor¬ 
mance and cost, as mentioned in ([Thai et al., 20146)1. 


In Grid environment, in which the resources are 
shared between multiple organisations, (|Ranganathan| 
and Foster, 2002|l was able to improved the overall 


performance of a distributed framework by process¬ 
ing data in close proximity to where it resided. Sim¬ 
ilarly, the authors of ( Kaya and Aykanat, 2006[ ) pro¬ 
posed a heuristic algorithm to improve performance 
in executing independent but file-sharing tasks. In 
( |Venugopal and Buyya, 2005] l, the authors assumed 
that each task required data which was distributed at 
multiple sources and proposed the auto-scaling algo¬ 
rithm to satisfy both deadline and budget constraints. 

However, the application of Grid computing re¬ 
search on Cloud computing is limited because: i) 
the Cloud resources are (virtually) unlimited, hence 
a user is free to add or remove VMs whenever she 
wants but ii) the monetary cost factor has to be con¬ 
sidered as the resource is no longer free-of-charge. 

Recently, running application on the Cloud has re¬ 
ceived attention from many researchers. Statistical 
learning had been used to schedule the execution of 
BoT on the Cloud ( [Oprescu and Kielmann, 2010 1. 
The method for scaling resource based on given bud¬ 
get constraint and desired application performance 
was also investigated ( Mao et al., 2010[ ). Neverthe¬ 
less, those papers did not consider the location of data. 

Cloud computing is employed for improving the 
performance of data intensive application, such as 


Hadoop, whose data is globally located (Ryden; 


jet al., 20T4 | i. Research that takes geographical dis¬ 
tance into account while executing workflows is re¬ 
ported in ([Luckeneder and Barker, 2013( [Thai et al., 
2014a|l. However, recent researches on applying 


Cloud computing for applications with geographi¬ 
cally distributed data only focus on improving the per¬ 
formance without considering the monetary cost. 

Our previous work ( Thai et al., 2014b) l aimed to 
determine a plan for executing BoDT on the Cloud, 
however, it made an assumption that there was only 
one VM that could be deployed at each Cloud region. 


Our paper differentiates itself from prior research 
by taking advantage of the decentralised infrastruc¬ 
ture of Cloud computing in executing BoDT applica¬ 
tion. We tries to decide not only the amount of re¬ 
sources but also the locations where resources, i.e. 
VMs, must be located. Moreover, our research ex¬ 
ploits of the virtually unlimited resources of Cloud 
computing by letting a user decides how much re¬ 
sources that she wants based on her budget. Fi¬ 
nally, the trade-off between performance gain and ad¬ 
ditional cost is also presented. 
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